LT TTT - A Flexible Tokenisation Tool

نویسندگان

Claire Grover

Colin Matheson

Andrei Mikheev

Marc Moens

چکیده

We describe LT TTT, a recently developed software system which provides tools to perform text tokenisation and mark-up. The system includes ready-made components to segment text into paragraphs, sentences, words and other kinds of token but, crucially, it also allows users to tailor rule-sets to produce mark-up appropriate for particular applications. We present three case studies of our use of LT TTT: named-entity recognition (MUC-7), citation recognition and mark-up and the preparation of a corpus in the medical domain. We conclude with a discussion of the use of browsers to visualise marked-up text.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Text Tokenisation Using unitok

This paper presents unitok, a tool for tokenisation of text in many languages. Although a simple idea – exploiting spaces in the text to separate tokens – works well most of the time, the rest of observed cases is quite complicated, language dependent and requires a special treatment. The paper covers the overall design of unitok as well as the way the tool deals with some language or web data ...

متن کامل

Tools to Address the Interdependence between Tokenisation and Standoff Annotation

In this paper we discuss technical issues arising from the interdependence between tokenisation and XML-based annotation tools, in particular those which use standoff annotation in the form of pointers to word tokens. It is common practice for an XML-based annotation tool to use word tokens as the target units for annotating such things as named entities because it provides appropriate units fo...

متن کامل

An Efficient and Flexible Format for Linguistic and Semantic Annotation

The paper describes an XML annotation format and tool developed within the MUCHMORE project. The annotation scheme was designed specifically for the purposes of Cross-Lingual Information Retrieval in the medical domain so as to allow both efficient and flexible access to layers of information. We use a parallel English-German corpus of medical abstracts and annotate it with linguistic informati...

متن کامل

Maca — a configurable tool to integrate Polish morphological data ∗

There are a number of morphological analysers for Polish. Most of these, however, are non-free resources. What is more, different analysers employ different tagsets and tokenisation strategies. This situation calls for a simple and universal framework to join different sources of morphological information, including the existing resources as well as user-provided dictionaries. We present such a...

متن کامل

Tsukuba Termination Tool

We present a tool for automatically proving termination of first-order rewrite systems. The tool is based on the dependency pair method of Arts and Giesl [1]. It incorporates several new ideas that make the method more efficient. The tool produces high-quality output and has a convenient web interface. If TTT succeeds in proving termination, it outputs a proof script which explains in considera...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2000

LT TTT - A Flexible Tokenisation Tool

نویسندگان

چکیده

منابع مشابه

Text Tokenisation Using unitok

Tools to Address the Interdependence between Tokenisation and Standoff Annotation

An Efficient and Flexible Format for Linguistic and Semantic Annotation

Maca — a configurable tool to integrate Polish morphological data ∗

Tsukuba Termination Tool

عنوان ژورنال:

اشتراک گذاری